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Abstract 

Mini-batch optimization has proven to be a powerful paradigm for large-scale learning. However, the state of 
the art parallel mini-batch algorithms assume synchronous operation or cyclic update orders. When worker nodes 
are heterogeneous (due to different computational capabilities or different communication delays), synchronous and 
cyclic operations are inefficient since they will leave workers idle waiting for the slower nodes to complete their 
computations. In this paper, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization 
problems with smooth loss functions that eliminates idle waiting and allows workers to run at their maximal update 
rates. We show that by suitably choosing the step-size values, the algorithm achieves a rate of the order 0{l/y/T) 
for general convex regularization functions, and the rate 0{1/T) for strongly convex regularization functions, where 
T is the number of iterations. In both cases, the impact of asynchrony on the convergence rate of our algorithm is 
asymptotically negligible, and a near-linear speedup in the number of workers can be expected. Theoretical results 
are confirmed in real implementations on a distributed computing infrastructure. 


I. Introduction 

Many optimization problems that arise in machine learning, signal processing, and statistical estimation can be 
formulated as regularized stochastic optimization (also referred to as stochastic composite optimization) problems in 
which one jointly minimizes the expectation of a stochastic loss function plus a possibly nonsmooth regularization 
term. Examples include Tikhonov and elastic net regularization, Lasso, sparse logistic regression, and support vector 
machines 0-0- 

Stochastic approximation methods such as stochastic gradient descent were among the first algorithms developed 
for solving stochastic optimization problems Q. Recently, these methods have received significant attention due 
to their simplicity and effectiveness (see, e.g., 0-1I3))- In particular, Nemirovski et. al. 0 demonstrated that 
for nonsmooth stochastic convex optimization problems, a modified stochastic approximation method, the mirror 
descent, exhibits an unimprovable convergence rate 0{1/Vt), where T is the number of iterations. Later, Lan |[^ 
developed a mirror descent algorithm for stochastic composite convex problems which explicitly accounts for the 
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smoothness of the loss function and achieves the optimal rate. A similar result for the dual averaging method was 
obtained by Xiao Q. 

The methods for solving stochastic optimization problems cited above are inherently serial in the sense that the 
gradient computations take place on a single processor which has access to the whole dataset. However, it happens 
more and more often that one single computer is unable to store and handle the amounts of data that we encounter 
in practical problems. This has caused a strong interest in developing parallel optimization algorithms which are 
able to split the data and distribute the computation across multiple processors or multiple computer clusters (see, 
e.g., and references therein). 

One simple and popular stochastic approximation method is mini-batching, where iterates are updated based on 
the average gradient with respect to multiple data points rather than based on gradients evaluated at a single data at 
a time. Recently, Dekel et. al. proposed a parallel mini-batch algorithm for regularized stochastic optimization 
problems, in which multiple processors compute gradients in parallel using their own local data, and then aggregate 
the gradients up a spanning tree to obtain the averaged gradient. While this algorithm can achieve linear speedup 
in the number of processors, it has the drawback that the processors need to synchronize at each round and, hence, 
if one of them fails or is slower than the rest, then the entire algorithm runs at the pace of the slowest processor. 

In this paper, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization problems 
with smooth loss functions that eliminates the overhead associated with global synchronization. Our algorithm allows 
multiple processors to work at different rates, perform computations independently of each other, and update global 
decision variables using out-of-date gradients. A similar model of parallel asynchronous computation was applied 
to coordinate descent methods for deterministic optimization in p^-p5) and mirror descent and dual averaging 
methods for stochastic optimization in p^ . In particular, Agarwal and Duchi have analyzed the convergence of 
asynchronous mini-batch algorithms for smooth stochastic convex problems, and interestingly shown that bounded 
delays do not degrade the asymptotic convergence. However, they only considered the case where the regularization 
term is the indicator function of a compact convex set. 

We extend the results of p^ to general regularization functions (like the /i norm, often used to promote sparsity), 
and establish a sharper expected-value type of convergence rate than the one given in p^ . Specihcally, we make 
the following contributions; 

(i) For general convex regularization functions, we show that when the constraint set is closed and convex (but not 
necessarily bounded), the running average of the iterates generated by our algorithm with constant step-sizes 
converges at rate 0{1/T) to a ball around the optimum. We derive an explicit expression that quantihes how 
the convergence rate and the residual error depends on loss function properties and algorithm parameters such 
as the constant step-size, the batch size, and the maximum delay bound Tmax- 

(ii) For general convex regularization functions and compact constraint sets, we prove that the running average of 
the iterates produced by our algorithm with a time-varying step-size converges to the true optimum (without 
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This result improves upon the previously known rate 

, ^max + 1 1 \ 

'v T ^ T + vry 

for delayed stochastic mirror descent methods with time-varying step-sizes given in In this case, our 
algorithm enjoys near-linear speedup as long as the number of processors is 0(T^^'^). 

(iii) When the regularization function is strongly convex and the constraint set is closed and convex, we establish 
that the iterates converge at rate 


O 


1 )' 


J^2 


If the number of processors is of the order of this rate is 0{1/T) asymptotically in T, which is the 

best known rate for strongly convex stochastic optimization problems in a serial setting. 

The remainder of the paper is organized as follows. In Section we introduce the notation and review some 


preliminaries that are essential for the development of the results in this paper. In Section III we formulate the 
problem and discuss our assumptions. The proposed asynchronous mini-batch algorithm and its main theoretical 
results are presented in Section IV Computational experience is reported in Section |V] while Section VI concludes 
the paper. 


II. Notation AND Preliminaries 

A. Notation 

We let N and Nq denote the set of natural numbers and the set of natural numbers including zero, respectively. 
The inner product of two vectors x,y G ffi" is denoted by {x, y). We assume that K” is endowed with a norm || • |j, 
and use || • |j* to represent the corresponding dual norm, defined by 

||y|U = sup {x,y). 

Ikll<i 

B. Preliminaries 

Next, we review the key definitions and results necessary for developing the main results of this paper. We start 
with the definition of a Bregman distance function, also referred to as a prox-function. 

Definition 1: A function oj : X ^ M. is called a distance generating function with modulus > 0 with respect 
to norm || • ||, if uj is continuously differentiable and pojStrongly convex with respect fo || • || over the set X C K". 
That is, for all x,y G X, 

uj{y) > w{x) + {Xuj{x),y-x) -f ^\\y - xW^. 
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Every distance generating function introduces a corresponding Bregman distance function 


Duj{x, y) ■■= uj{y) - uj(x) - (Va;(a:), y-x). 


For example, choosing uj{x) = IlIccHl, which is 1-strongly convex with respect to the Z 2 -norm over any convex 
set X, would result in Di^{x,y) = |||a; — t/Hl- Another common example of distance generating functions is the 
entropy function 


n 

w(a;) = 'E^Xi loga;^, 

i=l 

which is 1-strongly convex with respect to the /i-norm over the standard simplex 


A := < a: e 


= 1, X > 0 


and its associated Bregman distance function is 


Duj{x,y) = ^2/,log—. 

• ^ 

The main motivation to use a generalized distance generating function, instead of the usual Euclidean distance 
function, is to design optimization algorithms that can take advantage of the geometry of the feasible set (see, 
e.g., 0, (0-@). 

Remark 1: The strong convexity of the distance generating function w always ensures that 


D^{x,y)>^\\y-x\\'^, yx,y e X, 
and D^{x, y) = 0 if and only if x = y. 

Remark 2: Throughout the paper, there is no loss of generality to assume that = 1. Indeed, if ^ 1, we 
can choose the scaled function u:{x) = yuj^x), which has modulus Jl^ — 1, to generate the Bregman distance 
function. 

The following dehnition introduces subgradients of proper convex functions. 

Definition 2: For a convex function T' : K" —>■ K U {+(X)}, a vector s € K" is called a subgradient of T' at 
x&W^ if 


T'(y) > T'(x) + (s,i/-x), Vy G M". 

The set of all subgradients o/'F at x is called the subdifferential ofat x, and is denoted by 9T'(x). 


III. Problem Setup 
W e consider stochastic convex optimization problems of the form 

minimize ^(x) := Ej [F"(x, 5)] +T'(x). (1) 
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Here, x G K" is the decision variable, ^ is a random vector whose probability distribution V is supported on a set 
S C M™, is convex and differentiable for each ^ G S, and 'I'(a:) is a proper convex function that may be 

nonsmooth and extended real-valued. Let us define 

fix) :=E^[Fix,0] = J^Fix,OdriO- (2) 

Note that the expectation function / is convex, differentiable, and V/(a;) = E^[Va;F(a;,^)] pO) . We use X* to 
denote the set of optimal solutions of Problem ([T]i and </>* to denote the corresponding optimal value. 

A difficulty when solving optimization problem ([T]) is that the distribution V is often unknown, so the expecta¬ 
tion cannot be computed. This situation occurs frequently in data-driven applications such as machine learning. 
To support these applications, we do not assume knowledge of / (or of V), only access to a stochastic oracle. Each 
time the oracle is queried with an a: € K", it generates an independent and identically distributed (i.i.d.) sample ^ 
from V and returns XxF{x,f). 

We also impose the following assumptions on Problem ([T]). 

Assumption 1 (Existence of a minimum): The optimal set X* is nonempty. 

Assumption 2 (Lipschitz continuity of F): For each ^ € S, the function F(-,C) has Lipschitz continuous gradient 
with constant L. That is, for all y, z G M", 


\\XxFiy,0 - ^.Fiz,o\U < L\\y - 4- 

Note that under Assumption]^ Vfix) is also Lipschitz continuous with the same constant L |j^. 
Assumption 3 (Bounded gradient variance): There exists a constant cr > 0 such that 

E5[||V,i^(x,0-V/(x)||2] <a2, VxeK". 


Assumption 4 (Closed effective domain of T'j; The function is simple and lower semi-continuous, and its ef¬ 
fective domain, dom T* = {x G K" | T'(x) < -l-c»}, is closed. 

Possible choices of T* include: 


• Unconstrained smooth minimization: T'(x) = 0. 

• Constrained smooth minimization: T' is the indicator function of a non-empty closed convex set C C K”, i.e., 


T'(x) = Icix) 


0, if X G C, 

-foo, otherwise. 


• li-regularized minimization: T'(x) = A||x||i with A > 0. 

• Constrained li-regularized minimization: In this case, T'(x) = A||x||i + Icix) with A > 0. 

Several practical problems in machine learning, statistical applications, and signal processing satisfy Assump¬ 
tions [^(see, e.g., Q-Q). One such example is li-regularized logistic regression for sparse binary classification. 
We are then given a large number of observations 


{Ci = iaj,bj) I aj G K”, G {-1,-fl}, j = 1,... ,m}, 
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and = A||x||i. The role of li regularization is to produce sparse solutions. 

One approach for solving Problem Q is the serial mini-batch method based on the mirror descent scheme 
Given a point x G dom tp, a single processor updates the decision variable x by sampling b i.i.d. random variables 
,..., ^6 from P, computing the averaged stochastic gradient 

5ave = ^ ^ '^xF(x, ^i), 

" i=l 

and performing the composite mirror descent update 


X G- argmin|(5ave,2) + + ^D^{x,z)^, 

where 7 is a positive step-size parameter. Under Assumptions [Iffl and choosing an appropriate step-size, this 
algorithm is guaranteed to converge to the optimum |32 Theorem 9]. However, in many emerging applications, 
such as large-scale machine learning and statistics, the size of dataset is so huge that it cannot fit on one machine. 
Hence, we need optimization algorithms that can be conveniently and efficiently executed in parallel on multiple 
processors. 


IV. An Asynchronous Mini-Batch Algorithm 

In this section, we will present an asynchronous mini-batch algorithm that exploits multiple processors to solve 
Problem Q- We characterize the iteration complexity and the convergence rate of the proposed algorithm, and 
show that these compare favourably with the state of the art. 


A. Description of Algorithm 


We assume p processors have access to a shared memory for the decision variable x. The processors may have 
different capabilities (in terms of processing power and access to data) and are able to update x without the need for 
coordination or synchronization. Conceptually, the algorithm lets each processor run its own stochastic composite 
mirror descent process, repeating the following steps: 

1) Read x from the shared memory and load it into the local storage location t; 

2) Sample b i.i.d random variables ^ 1 ,..., from the distribution V', 

3) Compute the averaged stochastic gradient vector 

1 ^ 

<?ave = ^ ^ xF{x, ^i)', 

^ i=l 

4) Update current x in the shared memory via 


x 


G- argmin<; {g 

ave; 


z) -f ^'(z) - 1 - ^D^{x, 
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The algorithm can be implemented in many ways as depicted in Figure [T] One way is to consider the p processors 
as peers that each execute the four-step algorithm independently of each other and only share the global memory 
for storing x. In this case, each processor reads the decision vector twice in each round; once in the first step 
(before evaluating the averaged gradient), and once in the last step (before carrying out the minimization). To 
ensure correctness. Step 4 must be an atomic operation, where the executing processor puts a write lock on the 
global memory until it has written back the result of the minimization (cf. Figure left). The algorithm can also 
be executed in a master-worker setting. In this case, each of the worker nodes retrieves x from the master in Step 1 
and returns the averaged gradient to the master in Step 3; the fourth step (carrying out the minimization) is executed 
by the master (cf. Figure [T] right) 




Local 

Data 


Distributed Memory 


Fig. 1. Illustration of two conceptually different realizations of Algorithm[^ (1) a shared memory implementation (left); (2) a master-worker 
implementation (right). In the shared memory setting shown to the left, processor P 2 reads x(2) from the shared memory and computes the 
averaged gradient vector g„c(2) = ^ xF(^x{2), As the processors are being run without synchronization, a;(3) and x{4) are 

written to the shared memory by other processors while P 2 is evaluating pave(2). The figure shows a snapshot of the algorithm at time instance 
fc = 5, at which the shared memory is locked by P 2 to read the current x, i.e. a;(4), to update it using the out-of-date gradient gave(2), and 
write x(5) to the memory. In the master-worker setting illustrated to the right, workers evaluate averaged gradient vectors in parallel and send 
their computations to buffers on the master processor, which is the sole entity with access to the global memory. The master performs an update 
using (possibly) out-of-date gradients and passes the updated decision vector x back to the workers. 


Independently of how we choose to implement the algorithm, processors may work at different rates: while one 
processor updates the decision vector (in the shared memory setting) or send its averaged gradient to the master 
(in the master-worker setting), the others are generally busy computing averaged gradient vectors. The processors 
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Algorithm 1 Asynchronous Mini-batch Algorithm (running on each processor) 
1: Inputs; positive step-sizes {7(fc)}fcgNoi batch size 6 S N. 

2 : Initialization: a:(0) G dom k = 0. 

3: repeat 

4: receive inputs ..., sampled i.i.d. from distribution V', 

1 ^ 

9a.Ye{d{k)) ^ ^ V ^F{x{d{k)),^,)-, 


1) ^ argmin|(pave(d(*)),2) + ^'( 2 ) + 


k i — /c -f I 5 


5: until termination test satisfied 


(3) 


that perform gradient evaluations do not need to be aware of updates to the decision vector, but can continue to 
operate on stale information about x. Therefore, unlike synchronous parallel mini-batch algorithms | |^ , there is 
no need for processors to wait for each other to finish the gradient computations. Moreover, the value x at which 
the average of gradients is evaluated by a processor may differ from the value of x to which the update is applied. 

Algorithm describes the p asynchronous processes that mn in parallel. To describe the progress of the overall 
optimization process, we introduce a counter k that is incremented each time x is updated. We let d{k) denote the 
time at which x used to compute the averaged gradient involved in the update of x{k) was read from the shared 
memory. It is clear that 0 < d(k) < k for all k G Nq. The value 

r(fc) := k — d(k) 

can be viewed as the delay between reading and updating for processors and captures the staleness of the information 
used to compute the average of gradients for the k-th update. We assume that the delay is not too long, i.e., there 
is a nonnegative integer Tmax such that 

0 < T{k) < Tmax- 

The value of Tmax is an indicator of the asynchronism in the algorithm and in the execution platform. In practice, 
Tmax will depend on the number of parallel processors used in the algorithm p3)-p5). Note that the cyclic-delay 
mini-batch algorithm p^ , in which the processors are ordered and each updates the decision variable under a fixed 
schedule, is a special case of Algorithm where d{k) = k — p + 1, or, equivalently, T{k) = p — 1 for all k. 

B. Convergence Rate for General Convex Regularization 

The following theorem establishes convergence properties of Algorithm when a constant step-size is used. 
Theorem 1: Let Assumptions^^^hold. Assume also that for all k G Nq, 

L(rj + 1)0 ■ 
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Then, for every T G N and any optimizer x* of Q. we have 

{x{G),x*) 


E[f{x,.,{T))] - < 


^ca 


iT 


2h{l - + 1 ) 2 ) 


where x^y^iT) is the Cesdro average of the iterates, i.e.. 


(T) 


T 


k=l 


C = 


Furthermore, b is the batch size, the expectation is taken with respect to all random variables | i = 

1,..., 6, fc = 0,..., T — 1}, and c G [1, b] is given by 

1, if II-11* = II-lb, 

2 inax|| 2 .||<i a;(a:), otherwise. 

Proof: See Appendix [A| ■ 

Theoremdemonstrates that for any constant step-size 7 satisfying the running average of iterates generated 
by Algorithm will converge in expectation to a ball around the optimum at a rate of 0{1/T). The convergence 
rate and the residual error depend on the choice of 7; decreasing 7 reduces the residual error, but it also results in 
a slower convergence. We now describe a possible strategy for selecting the constant step-size. Let be the total 
number of iterations necessary to achieve e-optimal solution to Problem that is, E[(/)(a;ave(’r))] — 4>* ^ when 

T >T^. If we pick 

e 


7 = 


ie(Tinax + 1)^ + CCr2/6’ 


(5) 


it follows from Theorem that the corresponding X 3 y^{T) satisfies 


E[())(a;ave(T'))] ~ f ^ 77 ( .^(Anax + 1 )^ H-7 ) + 


T 


ccr" 

l>e 


e 

2 ’ 


where eo = Zlaj(a^(0),a:*). This inequality tells us that if the hrst term on the right-hand side is less than e/2, i.e., 
if 


T>T,:= 2eo 


Lir^aax + 1 )^ 


CtT 

6 e 2 


then E[(/(a;ave(T’))] —</*<£■ Hence, the iteration complexity of Algorithmwith the step-size choice 0 is given 
by 


O 


L{t„ 


1)^ ccr^ 
~^b^ 


(6) 


As long as the maximum delay bound Tmax is of the order Ijs/e, the first term in (|^ is asymptotically negligible, 
and hence the iteration complexity of Algorithm is asymptotically 0{ca^ jbe^), which is exactly the iteration 
complexity achieved by the mini-batch algorithm for solving stochastic convex optimization problems in a serial 
setting | |32| . As discussed before, Tmax is related to the number of processors used in the algorithm. Therefore, 
if the number of processors is of the order of 0{l/y/e), parallelization does not appreciably degrade asymptotic 
convergence of Algorithm [T] Furthermore, as p processors are being run in parallel, updates occur roughly p times 
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as quickly and in time scaling as T/p, the processors may compute T averaged gradient vectors (instead of Tjp 
vectors). This means that the near-linear speedup in the number of processors can be expected. 

Remark 3: Another strategy for the selection of the constant step-size in Algorithm is to use 7 that depends 
on the prior knowledge of the number of iterations to be performed. More precisely, assume that the number of 
iterations is fixed in advance, say equal to Tp. By choosing 7 as 


7 = 


^(T’max + 1)^ + CXy/Tp 

for some a > 0, it follows from Theorem [T] that the running average of the iterates after Tp iterations satisfies 


mfi/' /rn\M I'k ^ Anax “f 1) fx(0), ail 1 ( ( / cw \ 

E[(/)(a;ave(7»)] - ^ - - + -^j= [^aDa,(x(0), x ) -f — j . 


2 \ 


It is easy to verify that the optimal choice of a, which minimizes the second term on the right-hand-side of the 
above inequality, is 


With this choice of a, we then have 


^2hD^ (x(0),x*) 


E[(/.(xave(TF))] -r 


lfD^{x(Q),x*) , CT^2cDa;(x(0),x*) 
Tp ^ V57V ' 


In the case that Tmax = 0, the preceding guaranteed bound reduces to the one obtained in ||^ Theorem 1] for the 
serial stochastic mirror descent algorithm with constant step-sizes. Note that in order to implement Algorithm [T] with 
the optimal constant step-size policy, we need to estimate an upper bound on (x( 0 ),x*), since i7ai(x(0),x*) 
is usually unknown. 

The following theorem characterizes the convergence of Algorithm [T] with a time-varying step-size sequence 
when dom T* is bounded in addition to being closed and convex. 

Theorem 2: Suppose that Assumptions^I^^hold. In addition, suppose that dom T* is compact and that •) 
is bounded on dom Ti. Let 


= max D^{x,y). 

x,?/Gdom ^ 


^{7(fc)}/cGNo to 7 (fc) ^ 


L{T^a,^ -f 1)2 + a{k) with 


a{k) 


<J\/cs/k + 1 
Rs/b 


then the Cesdro average of the iterates generated by Algorithm satisfies 

E[^(x.„(r))] -r< + 


2(7R^/c 


for all T G N. 

Proof: See Appendix [B] ■ 

The time-varying step-size y{k), which ensures the convergence of the algorithm, consists of two terms: the 
time-varying term rjfk) should control the errors from stochastic gradient information while the role of the constant 
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term (i(Tiiiax +1)^) is to decrease the effects of asynchrony (bounded delays) on the convergence of the algorithm. 
According to Theorem]^ in the case that Tmax = the delay becomes increasingly harmless as the algorithm 

progresses and the expected function value evaluated at Xaye{T) converges asymptotically at a rate 0(1/VT), which 
is known to be the best achievable rate of the mirror descent method for nonsmooth stochastic convex optimization 
problems Q. 

For the special case of the optimization problem Q where is restricted to be the indicator function of a compact 
convex set, Agarwal and Duchi Theorem 2] showed that the convergence rate of the delayed stochastic mirror 
descent method with time-varying step-size is 

-f i?G'rmax aRy/c LR'^ log T 

\ T Vn ca^T 

where G is the maximum bound on -\/E[|| Va;F(a::, ^)||^]. Comparing with this result, instead of a asymptotic penalty 
of the form 0{T^^^logT/T) due to the delays, we have the penalty which is much smaller for large 

T. Therefore, not only do we extend the result of | |^ to general regularization functions, but we also obtain a 
sharper guaranteed convergence rate than the one presented in | |3^ . 

C. Convergence Rate for Strongly Convex Regularization 

In this subsection, we restrict our attention to stochastic composite optimization problems with strongly convex 
regularization terms. Specihcally, we assume that T* is /i$-strongly convex with respect to || • |j, that is, for any 
x,y & dom T', 

T'(y) > T'(a;) + {s,y - x) + ^\\y - xf, Vs G d'lt{x). 

Examples of the strongly convex function T* include: 

• l 2 -regularization: T'(a;) = (p/2)||a:||f with p > 0. 

• Elastic net regularization: T'(a;) = A||a;||i + (p/2)||a:||| with A > 0 and p > 0. 

Remark 4: The strong convexity of T* implies that Problem ([T]i has a unique minimizer x* [ |4T1 Corollary 11.16]. 
In order to derive the convergence rate of Algorithm for solving ([T]) with a strongly convex regularization term, 
we need to assume that the Bregman distance function D{x,y) used in the algorithm satishes the next assumption. 
Assumption 5 (Quadratic growth condition): For all x,y G dom tp, we have 

Doj{x,y) < ^\\x - yf, 

with Q> ytui- 

For example, if io(x) = then Di^{x,y) = |||a: —y||| and Q = 1. Note that Assumptionwill automatically 

hold when the distance generating function uj has Fipschitz continuous gradient with a constant Q 0- 
The associated convergence result now reads as follows. 
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Theorem 3: Suppose that the regularization function 'P is -strongly convex and that Assumptions^^^hold. 
If {lik)}keno to + 1)^ +/3(fc) with 

P{k) = {k + Tniax + l) ) 


then the iterates produced by Algorithm^satisfies 

2 ((yLQ '' ^ 

E[||a:(r)-x*f] 


V 


\2 

+ ij (^Vnax + 1)^ 


(T+l)2 


•D^ (x{0),x* 


ISccr^Q^ 

bpl{T+l)’ 


for all T G N. 

Proof: See Appendix [C] ■ 

An interesting point regarding Theorem is that for solving stochastic composite optimization problems with 
strongly convex regularization functions, the maximum delay bound r^ax can be as large as without 

affecting the asymptotic convergence rate of Algorithm In this case, our asynchronous mini-batch algorithm 
converges asymptotically at a rate of 0{\/T), which matches the best known rate achievable in a serial setting. 


V. Experimental Results 


We have developed a complete master-worker implementation of our algorithm in C/-H- using the Massage 
Passing Interface libraries (OpenMPI). Although we argued in Section IV that Algorithm [T] can be implemented 
using atomic operations on shared-memory computing architectures, we have chosen the MPI implementation due 
to its flexibility in scaling the problem to distributed-memory environments. 

We evaluated our algorithm on a document classification problem using the text categorization dataset rcvl | [42) . 
This dataset consists of m « 800000 documents, with n « 50000 unique stemmed tokens spanning 103 topics. Out 
of these topics, we decided to classify sports-related documents. To this end, we trained a sparse (binary) classifier 
by solving the following Zi-regularized logistic regression problem 


minimize [log (1 -f exp a;)))] -f A||a;||i 

subject to < R- 

Here, G K" is the sparse vector of token weights assigned to each document, and G { — 1,1} indicates whether 
a selected document is sports-related, or not (bi is 1 if the document is about sport, —1 otherwise). To evaluate 
scalability, we used both the training and test sets available when solving the optimization problem. We implemented 
Algorithm with time-varying step-sizes, and used a batch size of 1000 documents. The regularization parameter 
was set to A = 0.01, and the algorithm was run until a fixed tolerance e was met. 

Figure presents the achieved relative speedup of the algorithm with respect to the number of workers used. 
The relative speedup of the algorithm on p processors is defined as S{p) = ti/tp, where ti and tp are the time it 
takes to run the corresponding algorithm (to e-accuracy) on 1 and p processing units, respectively. We observe a 
near-linear relative speedup, consistent with our theoretical results. The timings are averaged over 10 Monte Carlo 
runs. 
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Fig. 2. Speedup of Algorithm ^ with respect to the number of workers. 


VI. Conclusions 

We have proposed an asynchronous mini-batch algorithm that exploits multiple processors to solve regularized 
stochastic optimization problems with smooth loss functions. We have established that for closed and convex 
constraint sets, the iteration complexity of the algorithm with constant step-sizes is asymptotically C>(l/e^). For 
compact constraint sets, we have proved that the running average of the iterates generated by our algorithm with 
time-varying step-size converges to the optimum at a rate Oil/y/T). When the regularization function is strongly 
convex and the constraint set is closed and convex, the algorithm achieves the rate of the order 0{1/T). We have 
shown that the penalty in convergence rate of the algorithm due to asynchrony is asymptotically negligible and 
a near-linear speedup in the number of processors can be expected. Our computational experience confirmed the 
theory. 


Appendix 


In this section, we prove the main results of the paper, namely. Theorems 00 We first state three key lemmas 
which are instrumental in our argument. 

The following result establishes an important recursion for the iterates generated by Algorithm [T] 

Lemma 1: Suppose Assumptions 00 hold. Then, the iterates {x{k)}i.^jqg generated by Algorithmsatisfy 

<t>{x{k + l))-<l>* + ^D^{xik + l),x*) < ^^l|e(rf(fc))||! 

-I- {e{d{k)), x{k) -X*) + ^^^Du,{x{k),x*) 

T('rniax -f 1) II /. .s, fl. ■ I 1^ll^ 

+-^- 2 ^\\x{k - j) - x{k - j + 1)\\ 

j=0 


-U- 


1 


2 V7(fe) 

x{k + 1) — X 


— rj{k) ) \\x{k -f 1) — *(fc)ir 




(7) 
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where x* G X*, {ri{k)} is a sequence of strictly positive numbers, and e{k) := Xf{x{k)) — Pave(fc) is the error in 
the gradient estimate. 

Proof: We start with the first-order optimality condition for the point a;(fc-|-1) in the minimization problem Q: 
there exists subgradient s(k -f 1 ) G d'^{x{k -f 1 )) such that for all z G dom tp, we have 

(^gme{d{k)) + s{k -f 1 ) -f (^2)Duj{x{k),x(k + 1)), z - x{k + l)^ > 0 , 

where V( 2 )^a;(’)’) denotes the partial derivative of the Bregman distance function with respect to the second 
variable. Plugging the following equality 


X(^ 2 )Duj{x{k),x{k + 1)) = Xuj{x{k -f 1)) — Xuj{x{k)), 


into the previous inequality and re-arranging terms gives 

^^l^j^(xuj(x{k)) - Xuj[x{k -I- 1)), z - x{k + 1)^ < (^g^^e[d{k)) + s{k + l),z - x{k + 1)^ 


pave (d(fc)), 2 — x{k -\- 1) 

-I- l^s{k ~\-l),z — x(k + 1) 

< ^gave(d(fc)),2-a;(fc-|- 1)^ 


where the last inequality used 


+ '^(z) — X{x{k -I- 1)) — ^ llz — x{k + 1) 


'P(z) > 'P(a:(fc-|-1)) -|-(s(fc-|-l),z — x(/c-|-l)) -f ~ 2 ;(A; -f 1) ||^, 

by the (strong) convexity of 'P. We now use the following well-known three point identity of the Bregman distance 
function 61 to rewrite the left-hand side of ([1): 

(Xu:(a) - Xuj{b),c - b) = D^{a, b) - D^{a, c) -f Di^{b, c). 

From this relation, with a = x{k), b = x{k -I- 1), and c = z, we have 

(^w[x{k)) — Xoj[x{k -I- 1)), z — x{k -|- 1)^ = {x{k),x{k + 1)) — D^i {x{k), z) + D^i {x{k + 1), z). 

Substituting the preceding equality into (|^ and re-arranging terms result in 

'^{x{k -I- 1)) - 5'(z) + ^^^D^(x{k + l),z) < (^gme{d{k)), z - x{k -|- 1)^ -I- -^D^[x{k),z) 

- y^D^{x{k),x{k + 1 )) - ^||z-a:(fc-|-1)||^. 

Since the distance generating function a;(a;) is 1-strongly convex, we have the lower bound 

D^{x{k),x{k + 1)) > ]^\\x{k -I- 1) - x{k)\\'^, 

which implies that 

'^(x{k -I- 1)) - VP(z) -I- -^D^(x{k + l),z) < (^g^ye{d{k)),z - x{k + + -^D^(x{k),z) 


( 8 ) 


l^^^\\x{k + 1) - x{k)f -^\\z-x{k + 1)||^ 


(9) 
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The essential idea in the rest of the proof is to use convexity and smoothness of the expectation function / to 
bound f{x(k + 1) — f{z) for each z € dom Ti and each k G Nq. According to Assumption]^ \7F{x,^) and, 
hence, V/(x) are Lipschitz continuous with the constant L. By using the T-Lipschitz continuity of V/ and then 
the convexity of /, we have 


fixik + 1)) < f{x{d{k))) + (V/(x((i(fc))),a;(fc+ 1) - x{d{k))) + -\\x{k + 1) - x{d{k))\\'^ 

- /(^) + ('^/(a;(c^(fc))),a:(fc + 1) - z) + ^||a:(fc + 1) - x{d{k))\\‘^ , 
for any z G dom T*. Combining inequalities ([^ and ( [T0| , and recalling that i/>(x) = f{x) + we obtain 

(j){x{k + 1)) - (piz) + + l),z) < (V/(a:(d(fc))) - g.^^^{d{k)), x{k + 1) - z) + -^D^[x(k),z) 

- + - ^\\z-x{k + l)\\^ 

+ ^\\x{k + l)-x{d{k))f. 

We now rewrite the above inequality in terms of the error e{d{k)) = W f{x{d{k))) — gave('i(A)) as follows: 
(l>{x{k + 1)) - (f>{z) + ^^^D^{x{k + 1), z) < {e{d{k)),x{k + 1) - z) + -ii^j^D^{x{k), z) 

- + 1) - x{k)f - ^||z - x{k + 1)||^ 

+ ^\\x{k + 1) - x{d{k))f 
= (e{d{k)) ,x{k + 1) — x{k)) 

Ui 

+ {e{d{k)),x{k) - z) + ■^;^j^D^{x{k),z) 

\x{k + 1) — x{k)\\'^ — ^llz — x{k + 1)11^ 


( 10 ) 


27(fc) 
L 


+ -||a;(fc + l)-a:(d(fc))|| . 
z V-^ 

U2 


( 11 ) 


We will seek upper bounds on the quantities Ui and 1 / 2 - Let {p(A:)}fegNo be a sequence of positive numbers. For 
Ui, we have 


Ui < 




< 


2r){k) 


e(d{k)), ^J'q{k){x{k + 1 ) - x{k)) 

®(^(^))ll! + -^||a;(^ + 1) - a;(A:)||' 


where the second inequality follows from the Fenchel-Young inequality applied to the conjugate pair 


( 12 ) 

P and 


;, i.e.. 


, ,, I ^ 1 II ||2 1 II, ||2 

{a,b}\ < -||a|p + -|| 6 || . 
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We now turn to C/ 2 . It follows from definition r(fc) = k — d{k) that 

k — d{k) 


U 2 = {k-d{k) + l) 

r{k) 


E 

3=0 


{k-j) - x{k-j+ 1 ) 


k — d{k) + 1 


= (T(fc) + 1)' 


E 

i=o 


x{k- 3 ) - x{k- 3 + 1) 


r(fc) + 1 


Then, by the convexity of the norm || • ||, we conclude that 

T(k) 


U 2 < (t(/c) + 1) “ x{k- j + 1)| 


i=o 


< (l-max + 1 ) E 3) - Xjk-j + 1)|| , (13) 

3=0 

where the last inequality comes from our assumption that r(fc) < Tmax for all k G Nq. Substituting inequalities 
and ( [T3 ] i into the bound and simplifying yield 


+ 1)) - cj)(z) + - 


2 r){k )' 




+ {e[d{k)), x{k) - z) + ^^j^Dui{x{k), z) 


+ 


T(Tniax “t“ 1) 


- i +1) 

3=0 


- ^ - v{k) ) \\xik + 1) - x(fc)f 


/itjf I 


2 — x{k + 1) 


Setting z = x*, where x* G X*, completes the proof. 

The next result follows from Lemma [T] by taking summation of the relations in Q. 

Lemma 2: Let Assumptions 00 hold. Assume also that {7(fc)}fe6No is set to 

where ri{k) is positive for all k. Then, the iterates produced by Algorithm satisfy 

^ {<l>{xik + 1)) - ^*) < ^ ||e(d(fc)) IIJ 

fc =0 fc =0 ^ 

r -1 ^ 

+ '^{<^id'{k)),x{k) - X*) + ^^Du,{x{0),x*) 


k=0 

T-l 

+ E 




-J: _L) 

l{k + l) 7(fe)y 

T-l 

||a;(fc + 1) — x*||" 


Du, {x{k + 1),**) 


for all T e N. 
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Proof: Applying Lemma with 


'q{k) = 


7 (fc) 


~ A('rmax + 1)^ 


adding and subtracting ^{k + 1) (a;(fc + 1), x*) to the left-hand side of Q, and re-arranging terms, we obtain 
(f){x{k + 1)) — ({>* + 


^ D^{x{k + l),x*) < ^^\\e{d{k))\\^^ 


'yik + 1) 


2r]{k )' 

-I- {e{d{k)),x{k) - X*) + ^^^^D^{x{k),x*) 

+ hik - j) - x{k - j + 1)11" 


7=0 


L(Tmax + 1 )^ 


\x{k + 1) — x{k)\\‘^ 


\x{k -I- 1) — a;* 

Summing the preceding inequality over k = 0,... ,T — 1, T G N, yields 

Y,{(l>lix{k + l))-f*^) + ^D^{xiT),x*) < ^ ^^||e(d(fc))||" 

k =0 ^ i’-n ^ 


k=0 


1 ^ ^ 


k=0 

T-1 


+ E 


( 7 (^ + 1 ) 7 (fe)) 


Di^ [x{k + 1), I*) 


+ 


L (Tmax + 1 ) 




k=0 j=0 
x2 T-1 


L{r +1) j2\\x{k + l)-x{kW 




2 
T -1 


Eii®(fe+ 1 ) -** 


fc =0 




+ '^^{e{d{k)),x{k)-x*) + (a;(0), a;*) 

- ^ Ell®(^+“^IT’ 


(14) 
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where the second inequality used the facts 

1 "^max "^max ^ 3 t 

^ ^||a;(fc-j)-a;(fc-j + l)f = ^ ^ \\x{k) - x{k + l)f 

k—0 j—0 j—0 k——j 

'^max ^ j 1 

= ||a;(fc)-ai(fc + l)f 

j—0 k—0 

Tmax T-l 

j—0 k—0 

T-l 

< (Tmax + 1) ^||a:(fc) - x{k + 1)||^, 
k=0 

and x{k) = a;(0) for all k < 0. Dropping the second term on the left-hand side of ([T^ concludes the proof ■ 
Lemma 3: Let || • || be a norm over M" and let |j • |j* be its dual norm. Let u; be a 1-strongly convex function 
with respect to || • || over K". If i/i,... ,yt G K” are mean zero random variables drawn i.i.d. from a distribution 
V, then 


E 


E 


Vi 




m 


where c G [ 1 , 6 ] is given by 


1, if II-11* = II-lb, 

2 max|| 2 .||^i a;(a:), otherwise. 


Proof: The result follows from |44 Lemma B.2] and convexity of the norm |j • ||*. For further details, see |32 


§4.1]. 

A. Proof of Theorem 

Assume that the step-size { 7 (/c)}fegNQ is set to 

7(fc) = 7 = 

^ E(Tmax “t“ 1) 

for some 77 > 0. It is clear that 7 satisfies Q. Applying Lemmawith = 0, 7 (fc) = 7 and r]{k) = rj, we obtain 


+ 1)) - <j>*) < Y, ^\Hd{k))\\l + ,x{k) - X*) + ^ 

U — n U — A ' l A ' 


(15) 


for all T S N. Each x{k), fc G N, is a deterministic function of the history ‘^[fc-i] •= {^iiO I * = • ■ • > f = 

0, — 1} but not of ^i{k). Since Wf{x) = E^[Va;F(a:, ^)], it follows that 

®IS[fc-ii [{(^{d{k)),x{k)- x*)] =0. 
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Moreover, as and are independent whenever i ^ j, it follows from Lemma that 


E[||e(d(fc))||^] =E 


1 


2=1 


^ ^ E ® fll V/(x(d(fc))) - V,F(cr(d(fc)), Cz) 


i=l 

^2 


< 


b ’ 


where the last inequality follows from Assumption Taking expectation on both sides of ( ffS] ) and using the above 
observations yield 

X;(E|0(x(t))l - r) < 


k=l 


By the convexity of (j), we have 


which implies that 


(/>(a:ave(T)) = ^ XI ^ E ^(^(^)) ’ 

fc=i / ^ fc=i 


E[(/>(xave(T))J ^ - 2r]b~^ jT 


Substituting ry = 7 ^ — L(Tmax + 1)^ into the above inequality proves the theorem. 


B. Proof of Theorem 

Assume that the step-size {7(^)}feGNo i® chosen such that = L(Tmax + 1)^ + a{k) where 

a^/c'fk + 1 


ii{k) = 


RVb 


Since j(k) is a non-increasing sequence, and D^{x,y) < for all x,y G dom Ti, we have 

T-l 


E 

k=0 


1 


7 (A:-|- 1 ) "/{k) 


D^[x{k + l),x*) < ( -E ] 


\l{T) 7(0) 


Applying Lemma with = 0 and r]{k) = a{k), taking expecation, and using Lemma completely identically 
to the proof of Theorem [T] we then obtain 




(16) 


Viewing the sum as an lower-estimate of the integral of the function y{t) = l/\/t 1, one can verify that 


T-l 


T-l 


E —= E::-^ 


1 


pT-l 


/c =0 


a \ Jo 


2 vT 


dt 


< 


where a = (cry^)/ {Ry/b). Substituting this inequality into the bound ([T6]l, we obtain the claimed guaranteed bound. 
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C. Proof of Theorem 

Assume that the step-size { 7 (fc)}fegNp in Algorithmis set to = 2L(Tmax + 1)^ + Pfk), with 

/3(fc) = {k + Tinax + l) ■ 

We first describe some important properties of 7 (fc) relevant to our proof. Clearly, ^{k) is non-increasing, i.e., 


1 1 

< 


7(fc) 7(A:-|-1)' 


for all k S Nq. Since 7 ( 0 ) ^ < 7 (fc) we have 


2f'('rniax + 1) + 


2 fi'I'T’rr 


< 


3Q 7 (A:)’ 


Moreover, one can easily verify that 
1 


1 _ M'l' 1 41/, >2 I (‘^n I 'i I 1 

7 (fc + 1)2 7 (fc )2 ~ Q [3 ^ + 3Q U ^ 


< ^ ^2L(Tinax + + 3^^ (^ + 


M'l' 1 

Q i{ky 


which implies that 


1 1/1 

< 


'){k + lY ~ 7 (/c) \^{k,) ' Q 

for all k G Ng- Finally, by the definition of j(k), we have 

l(k) 


= 1 


Mil/. 
3Q ' 


'y(k -f Tmax) 2L(rinax + 1)^ + ^ + Anax + l) 


< 1 + 


l-l<SrT„ 


6LQ(Tiaax + 1)^ ’ 


and hence. 


1 


< 1 




^{k -\~ Anax) 

We are now ready to prove Theorem Applying Lemma with 

1 


6LQ{Tma^ + 1)2/ 7(fc) ' 


r]{k) = 


2-f{k)'' 


k e No, 


and using the fact 


D^{x{k + l),a;*) < ^||a;(fc + 1) — 


by Assumption]^ we obtain 

</>(®(fc+ 1)) - 0* + D^{x{k + l),x*) < 7 (fc)||e(d(fc))||) 


(17) 


(18) 


(19) 


( 20 ) 


-I- (e{d{k)), x{k) - X*) + ^^^D^[x{k),x*) 


-f 


7/('rmax -f 1) 


'^\\x{k - j) - x{k - j +' 

3=0 


47(/c) 


\x{k + 1) — x{k)\\'^. 
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Multiplying both sides of this relation by and then using < [T9] >, we have 

^(-^(a;(fc + l)) -</>*) + + < ||e(d(fc))||" 




-f/(r'max + 1 ) 


27(fc) 


^\\x{k-j)-x{k-j + l)\\'^ 


j=0 


1 

4'y{k) 


2\\xik + l)-x{k)f. 


Summing the above inequality from k = 0tok = T— 1, T SN, and dropping the first term on the left-hand side 
yield 

Z^2D4^inx*) <J2\\eid{k))\\l 
' k=0 

T—l 

r / I 1 \ '’‘max 

-Y\\x{k-j) -x{k-j + 1 )||^ 


1 

- E 


4 7(fc)^ 


'y(k)' 

k=0 j=0 ’ 


\\x{k + 1) - x(k)f. 


What remains is to bound the third term on the right-hand side of (|2T|. It follows from ([T7|i-(|20|i that 


( 21 ) 


E E -i) - J + of - 'f: ^||.W - +1)11 


A;=0 j = 0 


7(fc) 


j=0 k=0 

■^max T — 1 


7(fc -h j) ' 


< 




j=0 k=0 

"^max 2 ^— 1 


f 'g' ^ - .(t + DIP 


^('Tmax + 1)^ 




f i EiE 1 

lEr7^7W-*(fc + i)ir- 


4 7(fe) 


Substituting the above inequality into ( |2T| l, and then taking expectation on both sides (similarly to the proof of 
Theorems [T] and [^, we have 




( 22 ) 


According to Remark [T] 


-\\x{T)-x*r<D4x{T),x*). 
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Moreover, by the definition of 7 (A:), 




sg 

Combing these inequalities with the bound (|2^, we conclude 


E[||a:(r)-x*f] < 

The proof is complete. 


ISca^Q^ ("^max + 1 )^ 


+ 1 ) 


(T+l)2 


(a::( 0 ),a;*). 
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